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Abstract 

When  building  a  Chinese  named  entity 
reeognition  system,  one  must  deal  with 
eertain  language-speeifie  issues  sueh  as 
whether  the  model  should  be  based  on 
eharaeters  or  words.  While  there  is  no 
unique  answer  to  this  question,  we  diseuss 
in  detail  advantages  and  disadvantages  of 
eaeh  model,  identify  problems  in  segmen¬ 
tation  and  suggest  possible  solutions,  pre¬ 
senting  our  observations,  analysis,  and 
experimental  results.  The  seeond  topie 
of  this  paper  is  elassifier  eombination. 
We  present  and  deseribe  four  elassifiers 
for  Chinese  named  entity  reeognition  and 
deseribe  various  methods  for  eombining 
their  outputs.  The  results  demonstrate  that 
elassifier  eombination  is  an  effeetive  teeh- 
nique  of  improving  system  performanee: 
experiments  over  a  large  annotated  eorpus 
of  fine-grained  entity  types  exhibit  a  10% 
relative  reduetion  in  F-measure  error. 


1  Introduction 

Named  entity  (NE)  reeognition  has  drawn  mueh  at¬ 
tention  in  reeent  years.  It  was  a  designated  task 
in  a  number  of  eonferenees,  ineluding  the  Mes¬ 
sage  Understanding  Conferenees  (MUC-6,  1995; 
MUC-7,  1997),  the  Information  Retrieval  and  Ex- 
traetion  Conferenee  (IREX,  1999),  the  Conferenees 
on  Natural  Eanguage  Eearning  (Tjong  Kim  Sang, 
2002;  Tjong  Kim  Sang  and  De  Meulder,  2003), 
and  the  reeent  Automatie  Content  Extraetion  Con¬ 
ferenee  (ACE,  2002). 


A  variety  of  algorithms  have  been  proposed  for 
NE  reeognition.  Many  of  these  algorithms  are,  in 
prineiple,  language-independent.  However,  when 
applying  these  algorithms  to  languages  sueh  as 
Chinese  and  Japanese,  we  must  deal  with  eer¬ 
tain  language-speeifie  issues:  for  example,  should 
we  build  a  eharaeter-based  model  or  a  word-based 
model?  how  do  word  segmentation  errors  affeet  NE 
reeognition?  how  should  word  segmentation  and  NE 
reeognition  internet  with  eaeh  other?  Besides  word 
segmentation  related  issues,  Chinese  does  not  have 
eapitalization,  whieh  is  a  very  useful  feature  in  iden¬ 
tifying  NEs  in  languages  sueh  as  English,  Spanish, 
or  Duteh.  How  does  the  laek  of  features  sueh  as  eap¬ 
italization  affeet  the  performanee? 

In  the  first  part  of  this  paper,  we  diseuss  these 
language-speeifie  issues  in  Chinese  NE  reeogni¬ 
tion.  In  partieular,  we  use  a  hidden  Markov  model 
(HMM)  system  as  an  example,  and  diseuss  various 
issues  related  to  applying  the  HMM  elassifier  to  Chi¬ 
nese.  The  HMM  elassifier  is  similar  to  the  one  de- 
seribed  in  (Bikel  et  al.,  1999). 

In  the  seeond  part  of  this  paper,  we  investigate 
the  eombination  of  a  set  of  diverse  NE  reeognition 
elassifiers.  Eour  statistieal  elassifiers  are  eombined 
in  the  experiments,  ineluding  the  above-mentioned 
hidden  Markov  model  elassifier,  a  transformation- 
based  learning  elassifier  (Brill,  1995;  Elorian  and 
Ngai,  2001),  a  maximum  entropy  elassifier  (Ratna- 
parkhi,  1999),  and  a  robust  risk  minimization  elassi¬ 
fier  (Zhang  et  al.,  2002). 

The  remainder  of  this  paper  is  organized  as  fol¬ 
lows:  Seetion  2  deseribes  the  experiment  data,  See- 
tion  3  diseusses  speeifie  issues  related  to  Chinese 
NE  reeognition,  Seetion  4  presents  the  four  elassi¬ 
fiers  and  approaehes  to  eombining  these  elassifiers. 
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2  Data 

We  used  three  annotated  Chinese  eorpora  in  our  ex¬ 
periments. 

The  IBM-FBIS  Corpus 

The  Foreign  Broadcast  Information  Service  (FBIS) 
offers  an  extensive  collection  of  translations  and 
transcriptions  of  open  source  information  monitored 
worldwide  on  diverse  topics  such  as  military  af¬ 
fairs,  politics,  economics,  and  science  and  technol¬ 
ogy.  The  IBM-FBIS  corpus  consists  of  approxi¬ 
mately  3,000  Chinese  articles  obtained  from  FBIS 
(about  3.2  million  Chinese  characters  in  total).  This 
corpus  was  tagged  by  a  native  Chinese  speaker  with 
32  NE  categories,  such  as  person,  location,  organi¬ 
zation,  country,  people,  date,  time,  percentage,  car¬ 
dinal,  ordinal,  product,  substance,  and  salutation. 
There  are  approximately  300,000  NEs  in  the  entire 
corpus,  16%  of  which  are  labeled  as  person,  16%  as 
organization,  and  11%  as  location. 

The  IBM-CT  Corpus 

The  Chinese  Treebank  (Xia  et  ah,  2000),  avail¬ 
able  from  Einguistic  Data  Consortium,  consists  of 
a  100,000  word  (approximately  160,000  characters) 
corpus  annotated  with  word  segmentation,  part-of- 
speech  tags,  and  syntactic  bracketing.  It  includes 
325  articles  from  Xinhua  newswire  between  1994 
and  1998.  The  same  Chinese  annotator  who  worked 
on  the  above  IBM-EBIS  data  also  annotated  the  Chi¬ 
nese  Treebank  data  with  NE  information,  henceforth 
the  IBM-CT  corpus,  using  the  same  32  categories  as 
mentioned  above. 

The  lEER  data 

The  National  Institute  of  Standard  and  Technol¬ 
ogy  organized  the  Information  Extraction  -  Entity 
Recognition  (lEER)  evaluation,  which  involves  en¬ 
tity  recognition  from  textual  information  sources  in 
both  English  and  Mandarin.  The  Mandarin  training 
data  consists  of  approximately  10  hours  of  broad¬ 
cast  news  transcripts  comprised  of  approximately 
390  stories.  The  test  data  also  contains  transcripts  of 
broadcast  news^.  The  training  data  includes  approx¬ 
imately  170,000  characters  and  the  test  data  includes 
approximately  6,500  characters.  Ten  categories  of 
NEs  were  annotated,  such  as  person,  location,  orga¬ 
nization,  date,  duration,  and  measure. 

'other  types  of  test  data  were  also  used  in  lEER  evaluation, 
including  newswire  text  and  real  automatic  speech  recognition 
transcripts,  but  we  did  not  use  them  in  our  experiments. 


3  Language- Specific  Issues  in  Chinese  NE 
Recognition 

Chinese  does  not  have  delimiters  between  words, 
so  a  key  design  issue  in  Chinese  NE  recognition 
is  whether  to  build  a  character-based  model  or  a 
word-based  model.  In  this  section,  we  use  a  hid¬ 
den  Markov  model  NE  recognition  system  as  an  ex¬ 
ample  to  discuss  language-specific  issues  in  Chinese 
NE  recognition. 

3.1  The  Hidden  Markov  Model  Classifier 

NE  recognition  can  be  formulated  as  a  classification 
task,  where  the  goal  is  to  label  each  token  with  a 
tag  indicating  whether  it  belongs  to  a  specific  NE 
or  is  not  part  of  any  NE.  The  HMM  classifier  used 
in  our  experiments  follows  the  algorithm  described 
in  (Bikel  et  ah,  1999).  It  performs  sequence  clas¬ 
sification  by  assigning  each  token  either  one  of  the 
NE  types  or  the  label  “O”  to  represent  “outside  any 
NE”.  The  states  in  the  HMM  are  organized  into  re¬ 
gions,  one  region  for  each  type  of  NE  plus  one  for 
“O”.  Within  each  of  the  regions,  a  statistical  lan¬ 
guage  model  is  used  to  compute  the  likelihood  of 
words  occurring  within  that  region.  The  transition 
probabilities  are  smoothed  by  deleted  interpolation, 
and  the  decoding  is  performed  using  the  Viterbi  al¬ 
gorithm. 

3.2  Character-Based,  Word-Based,  and 
Class-Based  Models 

To  build  a  model  for  identifying  Chinese  NEs,  we 
need  to  determine  the  basic  unit  of  the  model:  char¬ 
acter  or  word.  On  one  hand,  the  word-based  model 
is  attractive  since  it  allows  the  system  to  inspect  a 
larger  window  of  text,  which  may  lead  to  more  in¬ 
formative  decisions.  On  the  other  hand,  a  word  seg- 
menter  is  not  error-prone  and  these  errors  may  prop¬ 
agate  and  result  in  errors  in  NE  recognition. 

Two  systems,  a  character-based  HMM  model  and 
a  word-based  HMM  model,  were  built  for  compar¬ 
ison.  The  word  segmenter  used  in  our  experiments 
relies  on  dictionaries  and  surrounding  words  in  lo¬ 
cal  context  to  determine  the  word  boundaries.  Dur¬ 
ing  training,  the  NE  boundaries  were  provided  to  the 
word  segmenter;  the  latter  is  restricted  to  enforce 
word  boundaries  at  each  entity  boundary.  Therefore, 
at  training  time,  the  word  boundaries  are  consistent 
with  the  entity  boundaries.  At  test  time,  however, 
the  segmenter  could  create  words  which  do  not  agree 
with  the  gold-standard  entity  boundaries. 
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Table  1 :  Performance  of  the  character-based  HMM 
model,  the  word-based  HMM  model,  and  the  class- 
based  HMM  model.  (The  precision,  recall,  and  F- 
measure  presented  in  this  table  and  throughout  this 
paper  are  based  on  correct  identification  of  all  the 
attributes  of  an  NE,  including  boundary,  content, 
and  type.) 

The  performance  of  the  character-based  model 
and  the  word-based  model  are  shown  in  Table  1 .  The 
two  corpora  used  in  the  evaluation,  the  IBM-FBIS 
corpus  and  the  lEER  corpus,  differ  greatly  in  data 
size  and  the  number  of  NE  types.  The  IBM-EBIS 
training  data  consists  of  3.1  million  characters  and 
the  corresponding  test  data  has  270,000  characters. 
As  we  can  see  from  the  table,  for  both  corpora,  the 
character-based  model  outperforms  the  word-based 
model,  with  a  lead  of  3  to  5.5  in  E-measure.  The  per¬ 
formance  gap  between  two  models  is  larger  for  the 
lEER  data  than  for  the  IBM-EBIS  data. 

We  also  built  a  class-based  NE  model.  After  word 
segmentation,  class  tags  such  as  number,  chinese- 
name,  foreign-name,  date,  and  percent  are  used  to 
replace  words  belonging  to  these  classes.  Whether 
a  word  belongs  to  a  specific  class  is  identified  by 
a  rule-based  normalizer.  The  performance  of  fhe 
class-based  HMM  model  is  also  shown  in  Table  1. 
Eor  fhe  IBM-EBIS  corpus,  fhe  class-based  model 
oulperforms  fhe  word-based  model;  for  fhe  lEER 
corpus,  fhe  class-based  model  is  worse  fhan  fhe 
word-based  model.  In  bofh  cases,  fhe  performance 
difference  befween  fhe  word-based  model  and  fhe 
class-based  model  is  very  small.  The  characfer- 
based  model  oulperforms  fhe  class-based  model  in 
bofh  lesls. 

A  more  careful  analysis  indicates  fhal  allhough 
fhe  word-based  model  performs  worse  fhan  fhe 
characler-based  model  overall  in  our  evaluation,  if 
performs  belter  for  cerlain  NE  types.  Eor  inslance, 
Ihe  word-based  model  has  a  belter  performance 
for  Ihe  organization  category  lhan  Ihe  character- 
based  model  in  bolh  lesls.  While  Ihe  character- 
based  model  has  an  E-measure  of  65.07  (IBM-EBIS) 
and  64.76  (lEER)  for  Ihe  organization  category. 


Ihe  word-based  model  achieves  E-measure  scores  of 
69.14  (IBM-EBIS)  and  72.38  (lEER)  respectively. 
One  reason  may  be  lhal  organization  names  tend  lo 
conlain  many  characters,  and  since  Ihe  word-based 
model  allows  Ihe  system  lo  analyze  a  larger  window 
of  lexl,  il  is  more  likely  lo  make  a  correcl  guess. 
We  can  integrate  Ihe  character-based  model  and  Ihe 
word-based  model  by  combining  Ihe  decisions  from 
Ihe  Iwo  models.  Eor  inslance,  if  we  use  Ihe  de¬ 
cisions  of  Ihe  word-based  model  for  Ihe  organiza¬ 
tion  category,  bul  use  Ihe  decisions  of  Ihe  character- 
based  model  for  all  Ihe  olher  categories,  Ihe  over¬ 
all  E-measure  goes  up  lo  76.91  for  Ihe  lEER  dala, 
higher  lhan  using  eilher  Ihe  character-based  or  word- 
based  model  alone.  Anolher  way  lo  integrate  Ihe 
Iwo  models  is  lo  use  a  hybrid  model  -  slarling  wilh 
a  word-based  model  and  backing  off  lo  character- 
based  model  if  Ihe  word  is  unknown. 

3.3  Granularity  of  Word  Segmentation 

We  believe  that  one  main  reason  for  the  lower  per¬ 
formance  of  the  word-based  model  is  that  the  word 
granularity  defined  by  the  word  segmenter  is  not 
suitable  for  the  HMM  model  to  perform  the  NE 
recognition  task.  What  exactly  constitutes  a  Chinese 
word  has  been  a  topic  of  major  debate.  We  are  in¬ 
terested  in  what  is  the  best  word  granularity  for  our 
particular  task. 

To  illustrate  the  word  granularity  problem  for  NE 
tagging,  we  take  person  names  as  an  example.  Our 
word  segmenter  marks  a  person’s  name  as  one  word, 
consistent  with  the  convention  used  by  the  Chinese 
treebank  and  many  other  word  segmentation  sys¬ 
tems.  While  this  may  be  useful  in  other  applications, 
it  is  certainly  not  a  good  choice  for  our  NE  model. 
Chinese  names  typically  contain  two  or  three  char¬ 
acters,  with  family  name  preceding  first  name.  Only 
a  limited  set  of  characters  are  used  as  family  names, 
while  the  first  name  can  be  any  character(s).  There¬ 
fore,  the  family  name  is  a  very  important  and  use¬ 
ful  feature  in  identifying  an  NE  in  the  person  cate¬ 
gory.  By  combining  the  family  name  and  the  first 
name  into  one  word,  this  important  feature  is  lost  to 
the  word-based  model.  In  our  tests,  the  word-based 
model  performs  much  worse  for  the  person  category 
than  the  character-based  model.  We  believe  that,  for 
the  purpose  of  NE  recognition,  it  is  better  to  separate 
the  family  name  from  the  first  name  in  word  segmen¬ 
tation,  although  this  is  not  the  convention  used  in  the 
Chinese  treebank. 

Other  examples  include  the  segmentation  of 


words  indicating  dates,  countries,  locations,  percent¬ 
ages,  measures,  and  ordinals.  For  instance,  “July 
4th”  is  expressed  by  four  characters  “7th  month  4th 
day”  in  Chinese.  The  word  segmenter  marks  the 
four  characters  as  a  single  word;  however,  the  sec¬ 
ond  and  the  last  character  are  actually  good  features 
for  indicating  date,  since  the  dates  are  usually  ex¬ 
pressed  using  the  same  structure  (e.g.,  “March  25th” 
is  expressed  by  “3rd  month  25th  day”  in  Chinese). 
For  reasons  similar  to  the  above,  we  believe  that  it 
is  better  to  separate  characters  representing  “month” 
and  “day”,  rather  than  combining  the  four  charac¬ 
ters  into  one  word.  A  similar  problem  can  be  ob¬ 
served  in  English  with  tokens  such  as  “61 -year-old 
man”  if  one  is  interested  in  identifying  a  person’s 
age,  in  which  case  ’year’  and  ’old’  are  good  features 
for  predication. 

The  above  analysis  suggests  that  a  better  way  to 
apply  a  word  segmenter  in  an  NE  system  is  to  first 
adapt  the  segmenter  so  that  the  segmentation  granu¬ 
larity  is  more  appropriate  to  the  particular  task  and 
model.  As  a  guideline,  characters  that  are  good  fea¬ 
tures  for  identifying  NEs  should  not  be  combined 
with  other  characters  into  word.  Additional  ex¬ 
amples  include  characters  expressing  “percent”  and 
characters  representing  monetary  measures  . 

3.4  The  Effect  of  Segmentation  Errors 

Word  segmentation  errors  can  lead  to  mistakes  in 
NE  recognition.  Suppose  an  NE  consists  of  four 
characters  ((71,(72,(73,(74),  if  the  word  segmen¬ 
tation  merges  (7i  with  a  character  preceding  it, 
then  this  NE  cannot  be  correctly  identified  by  fhe 
word-based  model  since  fhe  boundary  will  be  incor- 
recf.  Besides  inducing  NE  boundary  errors,  incor- 
recf  word  segmenfafion  also  leads  fo  wrong  mafch- 
ings  befween  fraining  examples  and  fesfing  exam¬ 
ples,  which  may  resulf  in  misfakes  in  identifying  en¬ 
tities. 

We  computed  fhe  upper  bound  for  fhe  word-based 
model  for  fhe  IBM-EBIS  fesf  presenfed  in  Table  1. 
The  upper  bound  of  performance  is  compufed  by 
dividing  fhe  fofal  number  of  NEs  whose  bound¬ 
aries  are  also  recognized  as  boundaries  by  fhe  word 
segmenter  by  fhe  fofal  number  of  NEs  in  fhe  cor¬ 
pus,  which  is  fhe  precision,  recall,  and  also  fhe  E- 
measure.  Eor  fhe  IBM-EBIS  fesf  dafa  in  Table  1, 
fhe  upper  bound  of  fhe  word-based  model  is  95.7  E- 
measure. 

We  also  did  fhe  following  experimenf  fo  measure 
fhe  effecl  of  word  segmenfafion  errors:  we  gave 


fhe  boundaries  of  NEs  in  fhe  fesf  dafa  fo  fhe  word 
segmenter  and  forced  if  fo  mark  enfify  boundaries 
as  word  boundaries.  This  eliminafes  fhe  word  seg¬ 
menfafion  errors  fhaf  inevifably  resulf  in  NE  bound¬ 
ary  errors.  Eor  fhe  IBM-EBIS  dafa,  fhe  word-based 
HMM  model  achieves  76.60  E-measure  when  fhe 
enfify  boundaries  in  fhe  fesf  dafa  are  given,  and  fhe 
class-based  model  achieves  77.77  E-measure,  higher 
fhan  fhe  77.19  E-measure  by  fhe  character-based 
model  in  Table  1.  Eor  fhe  lEER  dafa,  fhe  E-measure 
of  fhe  word-based  model  improves  from  70.83  fo 
73.74  when  fhe  enfify  boundaries  are  given,  and  fhe 
class-based  model  improves  from  70.20  fo  72.47. 

This  suggesfs  fhaf  wifh  fhe  improvemenf  in  Chi¬ 
nese  word  segmenfafion,  fhe  word-based  model  may 
achieve  comparable  or  belter  performance  fhan  fhe 
characler-based  model. 

3.5  Lexical  Features 

Capitalization  in  English  gives  good  evidence  of 
names.  Our  HMM  classifier  for  English  uses  a  set 
of  word-features  to  indicate  whether  a  word  con¬ 
tains  all  capitalized  letters,  only  digits,  or  capital¬ 
ized  letters  and  period,  as  described  in  (Bikel  et  ah, 
1999).  However,  Chinese  does  not  have  capitaliza¬ 
tion.  When  we  applied  the  HMM  system  to  Chinese, 
we  retained  such  features  since  Chinese  text  also  in¬ 
clude  digits  and  roman  words  (such  as  in  product 
or  company  names).  In  an  attempt  to  investigate  the 
usefulness  of  such  features  for  Chinese,  we  removed 
them  from  the  system  and  observed  very  little  dif¬ 
ference  in  overall  performance  (0.4  difference  in  E- 
measure). 

3.6  Sensitivity  to  Corpus  and  Training  Size 
Variation 

To  test  the  robustness  of  the  model,  we  trained  the 
system  on  the  100,000  word  IBM-CT  data  and  tested 
on  the  same  IBM-EBIS  data.  The  character-based 
model  achieves  61.36  E-measure  and  the  word- 
based  model  achieves  58.40  E-measure,  compared 
to  77.19  and  74.17,  respectively,  using  the  20  times 
larger  IBM-EBIS  training  set.  This  represents  an  ap¬ 
proximately  20%  relative  reduction  in  performance 
when  trained  on  a  related  yet  different  and  consider¬ 
ably  smaller  training  set.  We  plan  to  investigate  fur¬ 
ther  the  relation  between  corpus  type  and  size  and 
performance. 


4  Classifier  Combination 

This  section  investigates  the  combination  of  a  set  of 
classifiers  for  NE  recognition.  We  first  introduce  the 
classifiers  used  in  our  experimenfs  and  fhen  describe 
fhe  combinafion  mefhods. 

4.1  The  Classifiers 

Besides  fhe  HMM  classifier  menfioned  in  fhe  previ¬ 
ous  section,  fhe  following  fhree  classifiers  were  used 
in  fhe  experimenfs. 

4.1.1  The  Transformation-Based  Learning 
(fnTBL)  Classifier 

Transformation-based  learning  is  an  error-driven 
algorifhm  which  has  fwo  major  sfeps:  if  sfarfs  by 
assigning  some  classification  fo  each  example,  and 
fhen  aufomafically  proposing,  evaluating  and  selecf- 
ing  fhe  classificafion  changes  fhaf  maximally  de¬ 
crease  fhe  number  of  errors. 

TBL  has  some  affracfive  qualifies  fhaf  make  if 
suifable  for  fhe  language-relafed  fasks:  if  can  au¬ 
fomafically  infegrafe  heferogeneous  fypes  of  knowl¬ 
edge,  wifhouf  fhe  need  for  explicif  modeling  (similar 
fo  Snow  (Dagan  el  ah,  1997),  Maximum  Enfropy, 
decision  frees,  elc);  if  is  error-driven,  Ihus  direclly 
minimizes  fhe  ullimale  evaluafion  measure:  fhe  er¬ 
ror  rale.  The  TBE  loolkil  used  in  Ihis  experimenl  is 
described  in  (Elorian  and  Ngai,  2001). 

4.1.2  The  Maximum  Entropy  Classifier 
(MaxEnt) 

The  model  used  here  is  based  on  fhe  maxi¬ 
mum  enfropy  model  used  for  shallow  parsing  (Raf- 
naparkhi,  1999).  A  sentence  wilh  NE  lags  is 
converted  info  a  shallow  free:  lokens  nof  in  any 
NE  are  assigned  an  “O”  lag,  while  lokens  wilhin 
an  NE  are  represenfed  as  consliluenls  whose  la¬ 
bel  is  fhe  same  as  fhe  NE  lype.  Eor  exam¬ 
ple,  fhe  annolafed  senlence  “I  will  fly  fo  (EO- 
CATION  New  York)  (DATEREE  lomorrow)”  is 
represenfed  as  a  free  “(S  I/O  will/0  lly/0  lo/O 
(EOCATION  New/EOCATION  York/EOCATION) 
(DATEREE  lomorrow/DATEREE)  )”.  Once  an  NE 
is  represenfed  as  a  shallow  free,  NE  recognilion  can 
be  realized  by  performing  shallow  parsing. 

We  use  fhe  fagging  and  chunking  model  described 
in  (Ralnaparkhi,  1999)  for  shallow  parsing.  In  fhe 
fagging  model,  fhe  confexl  consisls  of  a  window  of 
five  lokens  (including  fhe  token  being  lagged  and 
fwo  tokens  fo  ils  lefl  and  fwo  tokens  fo  ils  righl)  and 
fwo  lags  fo  fhe  lefl  of  fhe  currenl  token.  Eive  groups 


of  fealure  lemplales  are  used:  token  unigram,  loken 
bigram,  loken  Irigram,  lag  unigram  and  lag  bigram 
(all  wilhin  fhe  confexl  window).  In  fhe  chunking 
model,  fhe  confexl  is  limited  fo  a  window  of  fhree 
sublrees:  fhe  previous,  currenl  and  nexl  sublree.  Un¬ 
igram  and  bigram  chunk  (or  lag)  labels  are  used  as 
fealures. 

4.1.3  The  Robust  Risk  Minimization  (RRM) 
Classifier 

This  syslem  is  a  varianl  of  fhe  lexl  chunking 
system  described  in  Zhang  el  al.  (2002),  where  fhe 
NE  recognilion  problem  is  regarded  as  a  sequenlial 
token-based  fagging  problem.  We  denole  by  {wi} 
(i  =  0,1, ...  ,m)  fhe  sequence  of  lokenized  lexl, 
which  is  fhe  inpuf  fo  our  syslem.  In  loken-based 
fagging,  fhe  goal  is  fo  assign  a  class-label  /*  to  ev¬ 
ery  loken  Wi. 

In  our  system.  Ibis  is  achieved  by  eslimaling  fhe 
conditional  probabilily  P(fj  =  c\xi)  for  every  pos¬ 
sible  class-label  value  c,  where  x*  is  a  fealure  vector 
associated  wilh  loken  i.  The  fealure  vecfor  x*  can 
depend  on  previously  predicted  class-labels 
bul  fhe  dependency  is  lypically  assumed  fo  be  lo¬ 
cal.  Given  such  a  condilional  probabilily  model, 
in  fhe  decoding  slage,  we  eslimafe  fhe  besl  possi¬ 
ble  sequence  of  f*’s  using  a  dynamic  programming 
approach. 

In  our  syslem,  fhe  condilional  probabilily  model 
has  fhe  following  paramelric  form: 

P{ti  =  c\xi,  {ti-i, . . .  ,f*-i})  =  T{w'^Xi  +  be), 

where  T{y)  =  min(l,  max(0,  y))  is  fhe  Iruncalion 
of  y  into  fhe  interval  [0, 1].  rUg  is  a  linear  weighl 
vector  and  be  is  a  conslanl.  Paramelers  Wc  and  be 
can  be  estimated  from  fhe  Iraining  dala. 

This  classificafion  melhod  is  based  on  approxi- 
malely  minimizing  fhe  risk  funclion.  The  general¬ 
ized  Winnow  melhod  used  in  (Zhang  el  ah,  2002) 
implemenls  such  a  robust  risk  minimization  melhod. 

4.1.4  Performance  of  the  Classifiers 

We  compared  the  performance  of  the  four  classi¬ 
fiers  by  training  and  testing  them  on  the  same  data 
sets.  We  divided  the  IBM-EBIS  corpus  into  three 
subsets:  2.8  million  characters  for  training,  330,000 
characters  for  development  testing,  and  330,000 
characters  for  testing.  Table  2  shows  the  results  of 
each  classifier  for  the  development  test  set  and  the 
evaluation  set.  The  RRM  and  fnTBE  classifiers  are 
the  best  performers  for  the  test  set,  followed  by  Max¬ 
Ent.  The  HMM  classifier  lags  behind  by  around  6 


Development  Test 

Test 

Precision 

Recall 

E-measure 

Precision 

Recall 

E-measure 

Baselinel 

17.52% 

24.92% 

20.58 

14.04% 

20.52% 

16.67 

Baseline2 

77.86  % 

45.26% 

57.24 

71.06% 

40.22% 

51.37 

HMM 

71.38% 

80.92% 

75.85 

71.73% 

77.55% 

74.53 

MaxEnt 

83.82% 

78.08% 

80.85 

85.43% 

75.22% 

80.00 

fnTBE 

76.85% 

81.55% 

80.08 

82.06% 

80.68% 

81.36 

RRM 

82.83% 

81.06% 

81.89 

84.53% 

78.64% 

81.48 

Table  2:  Baselines  and  performanee  of  the  four  elassifiers. 


points  in  F-measure  from  the  best  system.  The  pre¬ 
sented  results  are  for  eharaeter-based  models. 

For  eomparison,  we  also  eomputed  two  baselines: 
one  in  whieh  eaeh  eharaeter  is  labeled  with  its  most 
frequent  label  (Baselinel  in  Table  2),  and  one  in 
whieh  eaeh  entity  that  was  seen  in  training  data  is 
labeled  with  its  most  frequent  elassifieation  (Base- 
line2  in  Table  2  -  this  baseline  is  eomputed  using 
the  software  provided  with  the  CoNLL-2003  shared 
task  (Tjong  Kim  Sang  and  De  Meulder,  2003)). 

4.2  Combination 

The  four  elassifiers  differ  in  multiple  dimensions, 
making  them  good  eandidates  for  eombination.  We 
explored  various  ways  to  eombine  the  results  from 
the  four  elassifiers. 

4.2.1  Combination  algorithms 

For  the  first  part  of  the  experimental  setup,  we 
eonsider  the  following  elassifieation  framework: 
given  the  (probabilistie)  output  Pr*  (•  \w)  of  n  elassi¬ 
fiers  Cl, ,  Cn,  the  elassifier  eombination  problem 
ean  be  viewed  a  probability  interpolation  problem 
-  eompute  the  elass  probability  distribution  eondi- 
tioned  on  the  joint  elassifier  output: 

P  (C\w,  CD  =  /  {{Pn  {C\w,  Q)},^i,„„)  (1) 

where  Ci  is  the  i*  elassifier’s  output,  w  is  an  ob¬ 
servable  eontext  (e.g.,  a  word  trigram)  and  /  is  a 
eombination  funetion.  A  eommonly  used  eombining 
seheme  is  through  linear  interpolation  of  the  elassi¬ 
fiers’  elass  probability  distributions: 

n 

p{c\w,c^)  «  5^p((:7|m,i,a)-p(i|m) 
^=1 
n 

=  (2) 
i=i 

The  weights  A*  (w)  eneode  the  importanee  given  to 
elassifier  i  in  eombination,  for  the  eontext  w,  and 


Pi  {C\w,  Ci)  is  an  estimation  of  the  probability  that 
the  eorreet  elassifieation  is  C,  given  that  the  output 
of  the  elassifier  i  on  eontext  w  is  Ci.  These  param¬ 
eters  from  Equation  (2)  ean  be  estimated,  if  needed, 
on  development  data. 

Table  3  presents  the  eombination  results,  for 
different  ways  of  estimating  the  interpolation  pa¬ 
rameters.  A  simple  eombination  method  is  the 
equal  voting  method  (van  Halteren  et  al.,  2001; 
Tjong  Kim  Sang  et  al.,  2000),  where  the  parameters 
are  eomputed  as  A*  {w)  =  A  and  P*  {C\w,Ci)  = 
S  {C,  Ci),  S  being  the  Kroneeker  operator: 

^  I  g  X  ^  y 

In  other  words,  eaeh  of  the  elassifiers  votes  with 
equal  weight  for  the  elass  that  is  most  likely  under 
its  model,  and  the  elass  reeeiving  the  largest  num¬ 
ber  of  votes  wins  (i.e.,  it  is  seleeted  as  the  elassifi¬ 
eation  output).  However,  this  proeedure  may  lead  to 
ties,  where  some  elassifieations  reeeive  an  identieal 
number  of  votes  -  one  usually  resorts  to  randomly 
seleeting  one  of  the  tied  eandidates  in  this  ease  -  Ta¬ 
ble  3  presents  the  average  results  obtained  by  this 
method,  together  with  the  varianee  obtained  over  30 
trials.  To  make  the  deeision  deterministieally,  the 
weights  assoeiated  with  the  elassifiers  ean  be  eho- 
sen  as  A*  (w)  =  1  —  Pj  (error).  In  this  method, 
presented  in  Table  3  as  weighted  voting,  better  per¬ 
forming  elassifiers  will  have  a  higher  impaet  on  the 
final  elassifieation. 

In  the  previously  deseribed  methods,  also  known 
as  voting,  eaeh  elassifier  gave  its  entire  vote  to  one 
elassifieation  -  its  own  output.  However,  Equa¬ 
tion  (2)  allows  for  elassifiers  fo  give  partial  eredit 
to  alternative  elassifieations,  through  the  probabil¬ 
ity  P*  ( (7  |ru,  Q).  In  the  experiments  deseribed  here, 
this  value  is  eomputed  direetly  on  the  development 
data.  However,  the  spaee  of  possible  ehoiees  for  C, 
w  and  Ci  is  large  enough  to  make  the  estimation 


Precision 

Recall 

F-measure 

Best  Classifier 

84.53 

78.64 

81.48 

Equal  weight  voting 

85.2±0.03 

81.7±0.02 

83.3±0.02 

Weighted  voting 

86.04 

80.63 

83.24 

Model  1 

83.07 

82.10 

82.58 

Model  2 

86.3 

77.55 

81.69 

RRM 

89.01 

79.61 

84.05 

RRM+  flags 

88.96 

79.84 

84.18 

RRM+IOB1+IOB2 

88.5 

80.46 

84.29 

Oracle 

91.6 

92.3 

91.95 

Table  3:  Classifier  eombination  results. 


unreliable,  so  we  use  two  approximations,  named 
Model  1  and  Model  2  in  Table  3:  P*  {C\w,  Ci)  = 
Pi  {C\woY  and  Pi  {C\w,Ci)  =  Pi  {C\Ci),  respee- 
tively.  Both  probability  distributions  are  estimated 
as  smoothed  relative  frequeneies  on  the  develop¬ 
ment  data.  Interestingly,  both  methods  underper¬ 
form  the  equal  voting  method,  a  faet  whieh  ean  be 
explained  by  inspeeting  the  results  in  Table  2:  the 
fnTBL  method  has  an  aeeuraey  (eomputed  on  devel¬ 
opment  data)  lower  than  the  MaxEnt  aeeuraey,  but 
it  outperforms  the  latter  on  the  test  data.  Sinee  the 
parameters  P  {C\w,Ci)  are  eomputed  on  the  devel¬ 
opment  data,  they  are  probably  favoring  the  Max¬ 
Ent  method,  resulting  in  lower  performanee.  On  the 
other  hand,  the  equal  voting  method  does  not  suffer 
from  this  problem,  as  its  parameters  are  not  depen¬ 
dent  on  the  development  data.  In  a  last  set  of  exper¬ 
iments,  we  extend  the  elassifieation  framework  to  a 
larger  spaee,  in  whieh  we  eompute  the  eonditional 
class  probability  distribution  conditioned  on  an  ar¬ 
bitrary  set  of  features: 

r(c|/f)=/({p.(ck,/f)}_^J  (3) 

This  setup  allows  us  to  still  use  the  classifications 
of  individual  systems  as  features,  but  also  allows  for 
other  types  of  conditioning  features  -  for  instance, 
one  can  use  the  output  of  any  classifier  (e.g.,  POS 
fags,  fext  chunk  labels,  etc)  as  features. 

In  the  described  experiments,  we  use  the  RRM 
method  to  compute  the  function  /  in  Equation  (3), 
allowing  the  system  to  select  a  good  performing 
combination  of  features.  At  training  time,  the  sys¬ 
tem  was  fed  the  output  of  each  classifier  on  the  de¬ 
velopment  data,  and,  in  subsequent  experiments,  the 
system  was  also  fed  a  flag  stream  which  briefly  iden¬ 
tifies  some  of  the  tokens  (numbers,  romanized  char- 

^wo  is  the  word  associated  with  the  context  w. 


acters,  etc)  and  the  output  of  each  system  in  a  differ¬ 
ent  NE  encoding  scheme. 

In  all  the  voting  experiments,  the  NEs  were  en¬ 
coded  in  an  lOBl  scheme,  since  it  seems  to  be  the 
most  amenable  to  combination.  Briefly,  the  lOB 
general  encoding  scheme  associates  a  label  with 
each  word,  corresponding  to  whether  it  begins  a  spe¬ 
cific  entity,  continues  the  entity,  or  is  outside  any  en¬ 
tity.  Tjong  Kim  Sang  and  Veenstra  (1999)  describes 
in  detail  the  lOB  schemes.  The  final  experiment  also 
has  access  to  the  output  of  systems  trained  in  the 
IOB2  encoding.  The  addition  of  each  feature  type 
resulted  in  better  performance,  with  the  final  resulf 
yielding  a  10%  relative  decrease  of  E-measure  error 
when  compared  with  the  best  performing  system^. 
Table  3  also  includes  an  upper-bound  on  the  classi¬ 
fier  combination  performance  -  the  performance  of 
the  switch  oracle,  which  selects  the  correct  classifi¬ 
cation  if  at  least  one  classifier  outputs  it. 

Table  3  shows  that,  at  least  for  the  examined 
types  of  combination,  using  a  robust  feature -based 
classifier  to  compute  the  classification  distribution 
yields  better  performance  than  combining  the  clas¬ 
sifications  through  either  voting  or  weighted  inter¬ 
polation.  The  RRM-based  classifier  is  able  to  in¬ 
corporate  heterogenous  information  from  multiple 
sources,  obtaining  a  2.8  absolute  E-measure  im¬ 
provement  versus  the  best  performing  classifier  and 
1.0  E-measure  gain  over  the  next  best  method. 

5  Related  Work 

Sun  et  al.  (2002)  proposes  to  use  a  class-based 
model  for  Chinese  NE  recognition.  Specifically,  it 
uses  a  character-based  trigram  model  for  the  class 
person,  a  word-based  model  for  the  class  location, 
and  a  more  complicated  model  for  the  class  organi¬ 
zation.  This  decision  is  consistent  with  our  observa¬ 
tion  that  the  character-based  model  performs  better 
than  the  word-based  model  for  classes  such  as  per¬ 
son,  but  is  worse  for  classes  such  as  organization. 

Sekine  and  Eriguchi  (2000)  provides  an  overview 
of  Japanese  NE  recognition.  It  presents  the  re¬ 
sults  of  15  systems  that  participated  in  an  evalua¬ 
tion  project  for  Information  Retrieval  and  Informa¬ 
tion  Extraction  (IREX,  1999).  Utsuro  et  al.  (2002) 
studies  combining  outputs  of  multiple  Japanese  NE 
systems  by  stacking.  A  second  stage  classifier  -  in 
fhis  case,  a  decision  lisf  -  is  trained  to  combine  the 
outputs  from  first  stage  classifiers.  This  is  similar 

^Measured  as  Fe  =  1  —  F. 


in  spirit  to  our  application  of  the  RRM  classifier  for 
combining  classifier  outputs. 

Classifier  combination  has  been  shown 
to  be  effective  in  improving  the  perfor¬ 
mance  of  NLP  applications,  and  have  been 
investigated  by  Brill  and  Wu  (1998)  and 
van  Halteren  et  al.  (2001)  for  part-of-speech  tag¬ 
ging,  Tjong  Kim  Sang  et  al.  (2000)  for  base  noun 
phrase  chunking,  and  Florian  et  al.  (2003a)  for 
word  sense  disambiguation.  Among  the  investigated 
techniques  were  voting,  probability  interpolation, 
and  classifier  stacking.  We  also  applied  the  classifier 
combination  technique  discussed  in  this  paper  to 
English  and  German  (Florian  et  ah,  2003b). 

6  Conclusion 

In  this  paper,  we  discuss  two  topics  related  to  Chi¬ 
nese  NE  recognition:  dealing  with  language-specific 
issues  such  as  word  segmentation,  and  combining 
multiple  classifiers  to  enhance  the  system  perfor¬ 
mance.  In  the  described  experiments,  the  character- 
based  model  consistently  outperforms  the  word- 
based  model  -  one  major  reason  for  this  fact  is  that 
the  segmentation  granularity  might  not  be  suited  for 
this  particular  task.  Combining  four  statistical  clas¬ 
sifiers,  including  a  hidden  Markov  model  classifier, 
a  transformation-based  learning  classifier,  a  maxi¬ 
mum  entropy  classifier,  and  a  robust  risk  minimiza¬ 
tion  classifier,  significantly  improves  the  system  per¬ 
formance,  yielding  a  10%  relative  reduction  in  F- 
measure  error  over  the  best  performing  classifier. 
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